Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 50
1.
bioRxiv ; 2024 Feb 29.
Article En | MEDLINE | ID: mdl-38464087

The gene expression profiles of distinct cell types reflect complex genomic interactions among multiple simultaneous biological processes within each cell that can be altered by disease progression as well as genetic background. The identification of these active cellular programs is an open challenge in the analysis of single-cell RNA-seq data. Latent Dirichlet Allocation (LDA) is a generative method used to identify recurring patterns in counts data, commonly referred to as topics that can be used to interpret the state of each cell. However, LDA's interpretability is hindered by several key factors including the hyperparameter selection of the number of topics as well as the variability in topic definitions due to random initialization. We developed Topyfic, a Reproducible LDA (rLDA) package, to accurately infer the identity and activity of cellular programs in single-cell data, providing insights into the relative contributions of each program in individual cells. We apply Topyfic to brain single-cell and single-nucleus datasets of two 5xFAD mouse models of Alzheimer's disease crossed with C57BL6/J or CAST/EiJ mice to identify distinct cell types and states in different cell types such as microglia. We find that 8-month 5xFAD/Cast F1 males show higher level of microglial activation than matching 5xFAD/BL6 F1 males, whereas female mice show similar levels of microglial activation. We show that regulatory genes such as TFs, microRNA host genes, and chromatin regulatory genes alone capture cell types and cell states. Our study highlights how topic modeling with a limited vocabulary of regulatory genes can identify gene expression programs in single-cell data in order to quantify similar and divergent cell states in distinct genotypes.

2.
Nature ; 2023 Dec 06.
Article En | MEDLINE | ID: mdl-38057666

Human limbs emerge during the fourth post-conception week as mesenchymal buds, which develop into fully formed limbs over the subsequent months1. This process is orchestrated by numerous temporally and spatially restricted gene expression programmes, making congenital alterations in phenotype common2. Decades of work with model organisms have defined the fundamental mechanisms underlying vertebrate limb development, but an in-depth characterization of this process in humans has yet to be performed. Here we detail human embryonic limb development across space and time using single-cell and spatial transcriptomics. We demonstrate extensive diversification of cells from a few multipotent progenitors to myriad differentiated cell states, including several novel cell populations. We uncover two waves of human muscle development, each characterized by different cell states regulated by separate gene expression programmes, and identify musculin (MSC) as a key transcriptional repressor maintaining muscle stem cell identity. Through assembly of multiple anatomically continuous spatial transcriptomic samples using VisiumStitcher, we map cells across a sagittal section of a whole fetal hindlimb. We reveal a clear anatomical segregation between genes linked to brachydactyly and polysyndactyly, and uncover transcriptionally and spatially distinct populations of the mesenchyme in the autopod. Finally, we perform single-cell RNA sequencing on mouse embryonic limbs to facilitate cross-species developmental comparison, finding substantial homology between the two species.

3.
Genome Res ; 2023 Oct 18.
Article En | MEDLINE | ID: mdl-37852782

Transcription factors (TFs) are trans-acting proteins that bind cis-regulatory elements (CREs) in DNA to control gene expression. Here, we analyzed the genomic localization profiles of 529 sequence-specific TFs and 151 cofactors and chromatin regulators in the human cancer cell line HepG2, for a total of 680 broadly termed DNA-associated proteins (DAPs). We used this deep collection to model each TF's impact on gene expression, and identified a cohort of 26 candidate transcriptional repressors. We examine high occupancy target (HOT) sites in the context of three-dimensional genome organization and show biased motif placement in distal-promoter connections involving HOT sites. We also found a substantial number of closed chromatin regions with multiple DAPs bound, and explored their properties, finding that a MAFF/MAFK TF pair correlates with transcriptional repression. Altogether, these analyses provide novel insights into the regulatory logic of the human cell line HepG2 genome and show the usefulness of large genomic analyses for elucidation of individual TF functions.

4.
bioRxiv ; 2023 May 16.
Article En | MEDLINE | ID: mdl-37292896

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

5.
Bioinformatics ; 39(4)2023 04 03.
Article En | MEDLINE | ID: mdl-36897015

SUMMARY: Large-scale sharing of genomic quantification data requires standardized access interfaces. In this Global Alliance for Genomics and Health project, we developed RNAget, an API for secure access to genomic quantification data in matrix form. RNAget provides for slicing matrices to extract desired subsets of data and is applicable to all expression matrix-format data, including RNA sequencing and microarrays. Further, it generalizes to quantification matrices of other sequence-based genomics such as ATAC-seq and ChIP-seq. AVAILABILITY AND IMPLEMENTATION: https://ga4gh-rnaseq.github.io/schema/docs/index.html.


RNA , Software , Genomics , Genome , Sequence Analysis, RNA
6.
Genome Biol ; 22(1): 286, 2021 10 07.
Article En | MEDLINE | ID: mdl-34620214

The rise in throughput and quality of long-read sequencing should allow unambiguous identification of full-length transcript isoforms. However, its application to single-cell RNA-seq has been limited by throughput and expense. Here we develop and characterize long-read Split-seq (LR-Split-seq), which uses combinatorial barcoding to sequence single cells with long reads. Applied to the C2C12 myogenic system, LR-split-seq associates isoforms to cell types with relative economy and design flexibility. We find widespread evidence of changing isoform expression during differentiation including alternative transcription start sites (TSS) and/or alternative internal exon usage. LR-Split-seq provides an affordable method for identifying cluster-specific isoforms in single cells.


RNA Isoforms/metabolism , RNA-Seq/methods , Single-Cell Analysis/methods , Animals , Cell Differentiation/genetics , Cell Line , Cell Nucleus/genetics , Chromatin/metabolism , Genomics , Mice , Models, Genetic , Myogenin/genetics , PAX7 Transcription Factor/genetics , Transcription Initiation Site , Transcription, Genetic
7.
Nature ; 583(7818): 720-728, 2020 07.
Article En | MEDLINE | ID: mdl-32728244

Transcription factors are DNA-binding proteins that have key roles in gene regulation1,2. Genome-wide occupancy maps of transcriptional regulators are important for understanding gene regulation and its effects on diverse biological processes3-6. However, only a minority of the more than 1,600 transcription factors encoded in the human genome has been assayed. Here we present, as part of the ENCODE (Encyclopedia of DNA Elements) project, data and analyses from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experiments using the human HepG2 cell line for 208 chromatin-associated proteins (CAPs). These comprise 171 transcription factors and 37 transcriptional cofactors and chromatin regulator proteins, and represent nearly one-quarter of CAPs expressed in HepG2 cells. The binding profiles of these CAPs form major groups associated predominantly with promoters or enhancers, or with both. We confirm and expand the current catalogue of DNA sequence motifs for transcription factors, and describe motifs that correspond to other transcription factors that are co-enriched with the primary ChIP target. For example, FOX family motifs are enriched in ChIP-seq peaks of 37 other CAPs. We show that motif content and occupancy patterns can distinguish between promoters and enhancers. This catalogue reveals high-occupancy target regions at which many CAPs associate, although each contains motifs for only a minority of the numerous associated transcription factors. These analyses provide a more complete overview of the gene regulatory networks that define this cell type, and demonstrate the usefulness of the large-scale production efforts of the ENCODE Consortium.


Chromatin Immunoprecipitation Sequencing , Chromatin/genetics , Chromatin/metabolism , DNA-Binding Proteins/metabolism , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid/genetics , Datasets as Topic , Enhancer Elements, Genetic/genetics , Hep G2 Cells , Humans , Nucleotide Motifs/genetics , Promoter Regions, Genetic/genetics , Protein Binding , Transcription Factors/metabolism
8.
Nature ; 583(7818): 760-767, 2020 07.
Article En | MEDLINE | ID: mdl-32728245

During mammalian embryogenesis, differential gene expression gradually builds the identity and complexity of each tissue and organ system1. Here we systematically quantified mouse polyA-RNA from day 10.5 of embryonic development to birth, sampling 17 tissues and organs. The resulting developmental transcriptome is globally structured by dynamic cytodifferentiation, body-axis and cell-proliferation gene sets that were further characterized by the transcription factor motif codes of their promoters. We decomposed the tissue-level transcriptome using single-cell RNA-seq (sequencing of RNA reverse transcribed into cDNA) and found that neurogenesis and haematopoiesis dominate at both the gene and cellular levels, jointly accounting for one-third of differential gene expression and more than 40% of identified cell types. By integrating promoter sequence motifs with companion ENCODE epigenomic profiles, we identified a prominent promoter de-repression mechanism in neuronal expression clusters that was attributable to known and novel repressors. Focusing on the developing limb, single-cell RNA data identified 25 candidate cell types that included progenitor and differentiating states with computationally inferred lineage relationships. We extracted cell-type transcription factor networks and complementary sets of candidate enhancer elements by using single-cell RNA-seq to decompose integrative cis-element (IDEAS) models that were derived from whole-tissue epigenome chromatin data. These ENCODE reference data, computed network components and IDEAS chromatin segmentations are companion resources to the matching epigenomic developmental matrix, and are available for researchers to further mine and integrate.


Embryo, Mammalian/cytology , Embryo, Mammalian/embryology , Embryonic Development/genetics , Gene Expression Regulation, Developmental , Single-Cell Analysis , Transcriptome , Animals , Cell Differentiation/genetics , Cell Lineage/genetics , Chromatin/genetics , Embryo, Mammalian/metabolism , Enhancer Elements, Genetic , Epigenomics , Extremities/embryology , Female , Male , Mice , Poly A/genetics , Poly A/metabolism , Promoter Regions, Genetic , RNA-Seq , Transcription Factors/metabolism
9.
Genome Res ; 29(11): 1900-1909, 2019 11.
Article En | MEDLINE | ID: mdl-31645363

MicroRNAs (miRNAs) play a critical role as posttranscriptional regulators of gene expression. The ENCODE Project profiled the expression of miRNAs in an extensive set of organs during a time-course of mouse embryonic development and captured the expression dynamics of 785 miRNAs. We found distinct organ-specific and developmental stage-specific miRNA expression clusters, with an overall pattern of increasing organ-specific expression as embryonic development proceeds. Comparative analysis of conserved miRNAs in mouse and human revealed stronger clustering of expression patterns by organ type rather than by species. An analysis of messenger RNA expression clusters compared with miRNA expression clusters identifies the potential role of specific miRNA expression clusters in suppressing the expression of mRNAs specific to other developmental programs in the organ in which these miRNAs are expressed during embryonic development. Our results provide the most comprehensive time-course of miRNA expression as part of an integrated ENCODE reference data set for mouse embryonic development.


Embryonic Development/genetics , MicroRNAs/genetics , Animals , Female , Gene Expression Regulation, Developmental , Mice , Pregnancy , RNA, Messenger/genetics
10.
Cell Syst ; 9(4): 321-337.e9, 2019 10 23.
Article En | MEDLINE | ID: mdl-31629685

Intrathymic T cell development converts multipotent precursors to committed pro-T cells, silencing progenitor genes while inducing T cell genes, but the underlying steps have remained obscure. Single-cell profiling was used to define the order of regulatory changes, employing single-cell RNA sequencing (scRNA-seq) for full-transcriptome analysis, plus sequential multiplexed single-molecule fluorescent in situ hybridization (seqFISH) to quantitate functionally important transcripts in intrathymic precursors. Single-cell cloning verified high T cell precursor frequency among the immunophenotypically defined "early T cell precursor" (ETP) population; a discrete committed granulocyte precursor subset was also distinguished. We established regulatory phenotypes of sequential ETP subsets, confirmed initial co-expression of progenitor with T cell specification genes, defined stage-specific relationships between cell cycle and differentiation, and generated a pseudotime model from ETP to T lineage commitment, supported by RNA velocity and transcription factor perturbations. This model was validated by developmental kinetics of ETP subsets at population and clonal levels. The results imply that multilineage priming is integral to T cell specification.


Models, Immunological , Pluripotent Stem Cells/physiology , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , T-Lymphocytes/physiology , Thymus Gland/physiology , Cell Differentiation , Cell Lineage , Gene Expression Profiling , Gene Expression Regulation , Gene Silencing , In Situ Hybridization, Fluorescence
11.
Proc Natl Acad Sci U S A ; 115(13): E2930-E2939, 2018 03 27.
Article En | MEDLINE | ID: mdl-29531064

RNA-sequencing (RNA-seq) is commonly used to identify genetic modules that respond to perturbations. In single cells, transcriptomes have been used as phenotypes, but this concept has not been applied to whole-organism RNA-seq. Also, quantifying and interpreting epistatic effects using expression profiles remains a challenge. We developed a single coefficient to quantify transcriptome-wide epistasis that reflects the underlying interactions and which can be interpreted intuitively. To demonstrate our approach, we sequenced four single and two double mutants of Caenorhabditis elegans From these mutants, we reconstructed the known hypoxia pathway. In addition, we uncovered a class of 56 genes with HIF-1-dependent expression that have opposite changes in expression in mutants of two genes that cooperate to negatively regulate HIF-1 abundance; however, the double mutant of these genes exhibits suppression epistasis. This class violates the classical model of HIF-1 regulation but can be explained by postulating a role of hydroxylated HIF-1 in transcriptional control.


Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans/genetics , Epistasis, Genetic , Gene Regulatory Networks , High-Throughput Nucleotide Sequencing/methods , Transcriptome , Animals , Caenorhabditis elegans/growth & development
12.
Dev Cell ; 32(6): 765-71, 2015 Mar 23.
Article En | MEDLINE | ID: mdl-25805138

Huang et al. (2013) recently reported that chromatin immunoprecipitation sequencing (ChIP-seq) reveals the genome-wide sites of occupancy by Piwi, a piRNA-guided Argonaute protein central to transposon silencing in Drosophila. Their study also reported that loss of Piwi causes widespread rewiring of transcriptional patterns, as evidenced by changes in RNA polymerase II occupancy across the genome. Here we reanalyze their data and report that the underlying deep-sequencing dataset does not support the authors' genome-wide conclusions.


Argonaute Proteins/genetics , DNA-Binding Proteins/genetics , Drosophila Proteins/genetics , RNA Polymerase II/genetics , Animals , Base Sequence , Binding Sites/genetics , Chromatin Immunoprecipitation , Drosophila melanogaster , Genome , High-Throughput Nucleotide Sequencing , Methyltransferases , RNA Interference , RNA, Small Interfering/genetics , Sequence Analysis, DNA
13.
Cell Stem Cell ; 16(1): 88-101, 2015 Jan 08.
Article En | MEDLINE | ID: mdl-25575081

Cellular reprogramming highlights the epigenetic plasticity of the somatic cell state. Long noncoding RNAs (lncRNAs) have emerging roles in epigenetic regulation, but their potential functions in reprogramming cell fate have been largely unexplored. We used single-cell RNA sequencing to characterize the expression patterns of over 16,000 genes, including 437 lncRNAs, during defined stages of reprogramming to pluripotency. Self-organizing maps (SOMs) were used as an intuitive way to structure and interrogate transcriptome data at the single-cell level. Early molecular events during reprogramming involved the activation of Ras signaling pathways, along with hundreds of lncRNAs. Loss-of-function studies showed that activated lncRNAs can repress lineage-specific genes, while lncRNAs activated in multiple reprogramming cell types can regulate metabolic gene expression. Our findings demonstrate that reprogramming cells activate defined sets of functionally relevant lncRNAs and provide a resource to further investigate how dynamic changes in the transcriptome reprogram cell state.


Cellular Reprogramming/genetics , RNA, Long Noncoding/genetics , Single-Cell Analysis/methods , Transcriptome/genetics , Animals , Cell Lineage/genetics , Gene Expression Regulation, Developmental , Genes, Developmental , Hematopoiesis/genetics , Induced Pluripotent Stem Cells/cytology , Induced Pluripotent Stem Cells/metabolism , Mice , Pluripotent Stem Cells/metabolism , RNA, Long Noncoding/metabolism , Signal Transduction/genetics , ras Proteins/metabolism
14.
BMC Bioinformatics ; 15: 331, 2014 Nov 20.
Article En | MEDLINE | ID: mdl-25411051

BACKGROUND: Gene co-expression analysis has previously been based on measures that include correlation coefficients and mutual information, as well as newcomers such as MIC. These measures depend primarily on the degree of association between the RNA levels of two genes and to a lesser extent on their variability. They focus on the similarity of expression value trajectories that change in like manner across samples. However there are relationships of biological interest for which these classical measures are expected to be insensitive. These include genes whose expression levels are ratiometrically stable and genes whose variance is tightly constrained. Large-scale studies of relatively homogeneous samples, including single cell RNA-seq, are experimental settings in which such relationships might be especially pertinent. RESULTS: We develop and implement a ratiometric approach for detecting gene associations (abbreviated RA). It is based on the coefficient of variation of the measured expression ratio of each pair of genes. We apply it to a collection of lymphoblastoid RNA-seq data from the 1000 Genomes Project Consortium, a typical sample set with high overall homogeneity. RA is a selective method, reporting in this case ~1/4 of all possible gene pairs, yet these relationships include a distilled picture of biological relationships previously found by other methods. In addition, RA reveals expression relationships that are not detected by traditional correlation and mutual information methods. We also analyze data from individual lymphoblastoid cells and show that desirable properties of the RA method extend to single-cell RNA-seq. CONCLUSION: We show that our ratiometric method identifies biologically significant relationships that are often missed or low-ranked by conventional association-based methods when applied to a relatively homogenous dataset. The results open new questions about the regulatory mechanisms that produce strong RA relationships. RA is scalable and potentially well suited for the analysis of thousands of bulk-RNA or single-cell transcriptomes.


Gene Expression Profiling/methods , Genetic Association Studies/methods , Sequence Analysis, RNA , Single-Cell Analysis , B-Lymphocytes/metabolism , Cell Line, Transformed , Human Genome Project , Humans
15.
Sci Rep ; 4: 5152, 2014 Jun 12.
Article En | MEDLINE | ID: mdl-24919486

Chromatin immunoprecipitation coupled with DNA sequencing (ChIP-seq) is the major contemporary method for mapping in vivo protein-DNA interactions in the genome. It identifies sites of transcription factor, cofactor and RNA polymerase occupancy, as well as the distribution of histone marks. Consortia such as the ENCyclopedia Of DNA Elements (ENCODE) have produced large datasets using manual protocols. However, future measurements of hundreds of additional factors in many cell types and physiological states call for higher throughput and consistency afforded by automation. Such automation advances, when provided by multiuser facilities, could also improve the quality and efficiency of individual small-scale projects. The immunoprecipitation process has become rate-limiting, and is a source of substantial variability when performed manually. Here we report a fully automated robotic ChIP (R-ChIP) pipeline that allows up to 96 reactions. A second bottleneck is the dearth of renewable ChIP-validated immune reagents, which do not yet exist for most mammalian transcription factors. We used R-ChIP to screen new mouse monoclonal antibodies raised against p300, a histone acetylase, well-known as a marker of active enhancers, for which ChIP-competent monoclonal reagents have been lacking. We identified, validated for ChIP-seq, and made publicly available a monoclonal reagent called ENCITp300-1.


Antibodies, Monoclonal/metabolism , Chromatin Immunoprecipitation/methods , E1A-Associated p300 Protein/metabolism , Protein Interaction Mapping/methods , Sequence Analysis, DNA/methods , Animals , Automation/methods , Histone Acetyltransferases/metabolism , Histones/metabolism , Mammals , Mice , Robotics , Transcription Factors/metabolism
16.
PLoS One ; 9(1): e84713, 2014.
Article En | MEDLINE | ID: mdl-24465428

Mitochondria contain their own circular genome, with mitochondria-specific transcription and replication systems and corresponding regulatory proteins. All of these proteins are encoded in the nuclear genome and are post-translationally imported into mitochondria. In addition, several nuclear transcription factors have been reported to act in mitochondria, but there has been no comprehensive mapping of their occupancy patterns and it is not clear how many other factors may also be found in mitochondria. Here we address these questions by using ChIP-seq data from the ENCODE, mouseENCODE and modENCODE consortia for 151 human, 31 mouse and 35 C. elegans factors. We identified 8 human and 3 mouse transcription factors with strong localized enrichment over the mitochondrial genome that was usually associated with the corresponding recognition sequence motif. Notably, these sites of occupancy are often the sites with highest ChIP-seq signal intensity within both the nuclear and mitochondrial genomes and are thus best explained as true binding events to mitochondrial DNA, which exist in high copy number in each cell. We corroborated these findings by immunocytochemical staining evidence for mitochondrial localization. However, we were unable to find clear evidence for mitochondrial binding in ENCODE and other publicly available ChIP-seq data for most factors previously reported to localize there. As the first global analysis of nuclear transcription factors binding in mitochondria, this work opens the door to future studies that probe the functional significance of the phenomenon.


Genome, Mitochondrial/genetics , Transcription Factors/metabolism , Animals , Computational Biology , Humans , Mice , Transcription Factors/genetics
17.
Genome Res ; 24(3): 496-510, 2014 Mar.
Article En | MEDLINE | ID: mdl-24299736

Single-cell RNA-seq mammalian transcriptome studies are at an early stage in uncovering cell-to-cell variation in gene expression, transcript processing and editing, and regulatory module activity. Despite great progress recently, substantial challenges remain, including discriminating biological variation from technical noise. Here we apply the SMART-seq single-cell RNA-seq protocol to study the reference lymphoblastoid cell line GM12878. By using spike-in quantification standards, we estimate the absolute number of RNA molecules per cell for each gene and find significant variation in total mRNA content: between 50,000 and 300,000 transcripts per cell. We directly measure technical stochasticity by a pool/split design and find that there are significant differences in expression between individual cells, over and above technical variation. Specific gene coexpression modules were preferentially expressed in subsets of individual cells, including one enriched for mRNA processing and splicing factors. We assess cell-to-cell variation in alternative splicing and allelic bias and report evidence of significant differences in splice site usage that exceed splice variation in the pool/split comparison. Finally, we show that transcriptomes from small pools of 30-100 cells approach the information content and reproducibility of contemporary RNA-seq from large amounts of input material. Together, our results define an experimental and computational path forward for analyzing gene expression in rare cell types and cell states.


Gene Expression Profiling/methods , Genes , RNA Splicing , RNA/analysis , Cell Line, Tumor , Genome, Human , Humans , RNA/genetics , Reproducibility of Results , Sequence Analysis, RNA , Transcriptome
18.
G3 (Bethesda) ; 4(2): 209-23, 2014 Feb 19.
Article En | MEDLINE | ID: mdl-24347632

ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.


Chromatin Immunoprecipitation/standards , Databases, Genetic/standards , High-Throughput Nucleotide Sequencing/standards , Sequence Analysis, DNA/standards , Animals , Data Interpretation, Statistical , High-Throughput Nucleotide Sequencing/methods , MyoD Protein/genetics , Quality Control , Sequence Analysis, DNA/methods , Transcription Factors/genetics
19.
Genome Res ; 23(12): 2136-48, 2013 Dec.
Article En | MEDLINE | ID: mdl-24170599

We tested whether self-organizing maps (SOMs) could be used to effectively integrate, visualize, and mine diverse genomics data types, including complex chromatin signatures. A fine-grained SOM was trained on 72 ChIP-seq histone modifications and DNase-seq data sets from six biologically diverse cell lines studied by The ENCODE Project Consortium. We mined the resulting SOM to identify chromatin signatures related to sequence-specific transcription factor occupancy, sequence motif enrichment, and biological functions. To highlight clusters enriched for specific functions such as transcriptional promoters or enhancers, we overlaid onto the map additional data sets not used during training, such as ChIP-seq, RNA-seq, CAGE, and information on cis-acting regulatory modules from the literature. We used the SOM to parse known transcriptional enhancers according to the cell-type-specific chromatin signature, and we further corroborated this pattern on the map by EP300 (also known as p300) occupancy. New candidate cell-type-specific enhancers were identified for multiple ENCODE cell types in this way, along with new candidates for ubiquitous enhancer activity. An interactive web interface was developed to allow users to visualize and custom-mine the ENCODE SOM. We conclude that large SOMs trained on chromatin data from multiple cell types provide a powerful way to identify complex relationships in genomic data at user-selected levels of granularity.


Chromatin/genetics , Chromatin/metabolism , Histones/genetics , Histones/metabolism , Transcription Factors/genetics , Algorithms , Cell Line , Chromosome Mapping , Computational Biology , Data Mining , Gene Ontology , Human Umbilical Vein Endothelial Cells , Humans , K562 Cells , Promoter Regions, Genetic , User-Computer Interface
20.
PLoS One ; 8(8): e74513, 2013.
Article En | MEDLINE | ID: mdl-23991223

Mitochondria contain a 16.6 kb circular genome encoding 13 proteins as well as mitochondrial tRNAs and rRNAs. Copies of the genome are organized into nucleoids containing both DNA and proteins, including the machinery required for mtDNA replication and transcription. The transcription factor TFAM is critical for initiation of transcription and replication of the genome, and is also thought to perform a packaging function. Although specific binding sites required for initiation of transcription have been identified in the D-loop, little is known about the characteristics of TFAM binding in its nonspecific packaging state. In addition, it is unclear whether TFAM also plays a role in the regulation of nuclear gene expression. Here we investigate these questions by using ChIP-seq to directly localize TFAM binding to DNA in human cells. Our results demonstrate that TFAM uniformly coats the whole mitochondrial genome, with no evidence of robust TFAM binding to the nuclear genome. Our study represents the first high-resolution assessment of TFAM binding on a genome-wide scale in human cells.


DNA, Mitochondrial/genetics , DNA-Binding Proteins/genetics , Genome, Human , Mitochondrial Proteins/genetics , Transcription Factors/genetics , Cell Nucleus/genetics , Chromatin Immunoprecipitation , DNA-Binding Proteins/immunology , HeLa Cells , Humans , Mitochondrial Proteins/immunology , Transcription Factors/immunology
...